Lab session 12: K CROSS VALIDATION AND Ensemble Stacking for Machine Learning

Name: Makesh Srinivasan
Registration number: 19BCE1717
Course code: CSE4020
Faculty: Dr. Abdul Quadir
Slot: L31 + L32
Date: 15-November-2021 Monday


Instructions:
Use a dataset to perfrom Kfold cross validation and generate an ensemble with KNN and Naive Bayes as first layer classifiers and Logistic regression as second layer classifier.

Dataset generation

Load the Iris dataset using the URL 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'. The Iris dataset consists of 3 classes each with fifty instances. For the sake of simplicity, only the first 100 containing 2 classes of species are used in this exercise

Data visualisation:


Structure:

1) K-fold cross validation using KNN classifier (manual)
2) K-fold cross validation using Naive Bayes classifier (sklearn)
3) Ensemble
4) Test the ensemble using new test data

1) KNN Classification (Classifier 1)

KNN functions:

K-fold cross validation using KNN:

Assumptions/given:
The number of folds is set to 10
K value in KNN is set to 3


The same can be performed using the sklearn package. The second classification model - Naive Bayes - is performed using Sklearn package below


2) Naive Bayes (Classifier 2)

10 fold cross validation using Gaussian Naive Bayes


The individual classifiers are done. Now, we can stack them to form an ensemble


3) Stacking

The stacking is done using the above two classifiers (KNN and Naive Bayes) with the meta classifier as Logistic regression

The objects of the two classifiers are created

Create an ensemble model using the estimators generated above with logistic regressor as the second level classifier:

Ensemble Kfold cross validation:

The 10 fold cross validation of the ensemble classifier is done below.


CONCLUSION:

The two individual classifiers (KNN and Naive Bayes) gave an overall average accuracy of 1.00. This means, the ensemble of the two using logistic regression must be greater than or equal to 1.00, which is the case as shown above.


4) Testing the ensemble classifier model using new data

KNN (individual prediction)

NOTE: for all values of K the predictions are the same - Iris-setosa (0)

Naive Bayes (individual prediction)

Ensemble stack (overall prediction)


Predictions on the test data:
1) KNN: Iris-Setosa
2) Naive Bayes: Iris-Setosa
3) Ensemble: Iris-Setosa


The Ensemble prediction is Iris-setosa (0) and this seems to be the true value from the visualisation shown in KNN above.

Therefore, we can conclude that the ensemble performs equally as good as the individual classifiers or better.